Skip to content

BUG: MultiIndex intersection with sort=False does not preserve order #31312

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Feb 12, 2020

Conversation

jeffzi
Copy link
Contributor

@jeffzi jeffzi commented Jan 25, 2020

The intersection of 2 MultiIndex with sort=False does not preserve the order, whereas Index. intersection() does. This behavior does not seem to be intentional since the tests for difference and symmetric_difference are testing for order preservation.

For example:

import pandas as pd

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
idx = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

left = idx[2::-1]
print(left)
#> MultiIndex([('baz', 'one'),
#>             ('bar', 'two'),
#>             ('bar', 'one')],
#>            names=['first', 'second'])
right = idx[:5]
print(right)
#> MultiIndex([('bar', 'one'),
#>             ('bar', 'two'),
#>             ('baz', 'one'),
#>             ('baz', 'two'),
#>             ('foo', 'one')],
#>            names=['first', 'second'])

# expected same order as left
intersect = left.intersection(right, sort=False)
print(intersect)
#> MultiIndex([('bar', 'two'),
#>             ('baz', 'one'),
#>             ('bar', 'one')],
#>            names=['first', 'second'])

Created on 2020-01-25 by the reprexpy package

I verified that my PR does not decrease performances.

I also modified the test for union. The implementation was preserving the order with sort=False but the tests were not verifying it.

@charlesdong1991
Copy link
Member

Very nice! @jeffzi

could you pls also create an issue and reference this PR to the issue?

@jeffzi jeffzi force-pushed the fix-multiindex-intersection-sort branch from 7a066e6 to 8320d41 Compare January 26, 2020 10:08
@jeffzi
Copy link
Contributor Author

jeffzi commented Jan 26, 2020

I benchmarked the intersection of 2 arrays of tuples: master Vs this PR Vs suggestion by @jreback to take inspiration of difference.

I also looked at the implementation of Index.intersection and found that performances can be increased in the case of 2 monotonic MultiIndex.

@jreback If that's ok with you, I can commit the "mixed" version below.

iimport numpy as np
import pandas as pd
from pandas.testing import assert_index_equal

def master(self, other):
    return set(self._ndarray_values) & set(other._ndarray_values)

def pr(self, other):
    lvals = self._ndarray_values
    rvals = other._ndarray_values
    
    other_uniq = set(rvals)
    seen = set()
    return [x for x in lvals if x in other_uniq and not (x in seen or seen.add(x))]

def indexer(self, other):
    lvals = self._ndarray_values

    if self.is_monotonic and other.is_monotonic:
        rvals = other._ndarray_values
        return self._inner_indexer(lvals, rvals)[0]

    other_uniq = other.unique()
    indexer = other_uniq.get_indexer(lvals)
    indexer = indexer.take((indexer != -1).nonzero()[0])

    return other_uniq.take(indexer)._ndarray_values

def mixed(self, other):
    lvals = self._ndarray_values
    rvals = other._ndarray_values

    if self.is_monotonic and other.is_monotonic:
        return self._inner_indexer(lvals, rvals)[0]

    other_uniq = set(rvals)
    seen = set()
    return [x for x in lvals if x in other_uniq and not (x in seen or seen.add(x))]

size = 100000 
left = pd.MultiIndex.from_arrays([np.arange(1, size), np.arange(1, size)])
right = pd.MultiIndex.from_arrays([np.arange(1, size//10)[::-1], np.arange(1, size//10)[::-1]])
right_monotonic = pd.MultiIndex.from_arrays([np.arange(1, size//10), np.arange(1, size//10)])

for r in [right, right_monotonic]:
    print(f"\nBoth monotonic: {r.is_monotonic}\n---------------")
    expected = pd.MultiIndex.from_tuples(pr(left, r))
    actual_indexer = pd.MultiIndex.from_tuples(indexer(left, r))
    assert_index_equal(expected, actual_indexer)
    actual_mixed = pd.MultiIndex.from_tuples(mixed(left, r))
    assert_index_equal(expected, actual_mixed)

    print("master implementation:")
    %timeit -n 10 master(left, r)
    print("pr implementation:")
    %timeit -n 10 pr(left, r)
    print("Suggested indexer implementation:")
    %timeit -n 10 indexer(left, r)
    print("Mixed implementation:")
    %timeit -n 10 mixed(left, r)
Both monotonic: False
---------------
master implementation:
18.2 ms ± 5.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
pr implementation:
10.6 ms ± 625 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Suggested indexer implementation:
170 ms ± 4.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Mixed implementation:
10.2 ms ± 562 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Both monotonic: True
---------------
master implementation:
15.1 ms ± 984 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
pr implementation:
10.4 ms ± 730 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Suggested indexer implementation:
2.56 ms ± 292 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Mixed implementation:
2.63 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
'platform': 'Darwin',
'platform-release': '19.2.0',
'architecture': 'x86_64',
'ram': '16 GB'

@jeffzi jeffzi changed the title Bug: MultiIndex intersection with sort=False does not preserve order BUG: MultiIndex intersection with sort=False does not preserve order Jan 26, 2020
@jeffzi jeffzi force-pushed the fix-multiindex-intersection-sort branch from 83518d1 to 5273652 Compare January 27, 2020 18:30
@jeffzi jeffzi requested a review from jreback January 28, 2020 09:08
@jeffzi jeffzi force-pushed the fix-multiindex-intersection-sort branch 3 times, most recently from 632f23e to 6b8f2d7 Compare February 4, 2020 09:20
@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Feb 9, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. can you add a whatsnew note in bugfixes for MI. ping on green.

@jreback
Copy link
Contributor

jreback commented Feb 9, 2020

pls merge master as well

@jreback jreback requested a review from jbrockmendel February 9, 2020 21:54
@jeffzi jeffzi force-pushed the fix-multiindex-intersection-sort branch from 74f5fde to 7fb060d Compare February 10, 2020 11:32
@jeffzi
Copy link
Contributor Author

jeffzi commented Feb 10, 2020

@jreback All green :)

@jreback jreback added this to the 1.1 milestone Feb 10, 2020
@jreback
Copy link
Contributor

jreback commented Feb 10, 2020

lgtm. @jbrockmendel

@jbrockmendel
Copy link
Member

Couple of comments/questions, no deal-breakers. looks nice

@WillAyd WillAyd merged commit c2f3ce3 into pandas-dev:master Feb 12, 2020
@WillAyd
Copy link
Member

WillAyd commented Feb 12, 2020

Thanks @jeffzi great first PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: MultiIndex intersection with sort=False does not preserve order
5 participants